A colleague asked for assistance creating some synthetic tabular data for a project today, and seemed to appreciate the advice I gave. On reflection I've spent quite a considerable amount of time working with synthetic data, so it seemed like a good opportunity to share what I've learned.
I work with data lakes, databases and various other data technologies every day. Apologies if you have come to this post looking for tips on creating synthetic images for your AI projects - I'm just focusing on tabular data today.
Generating data is really easy...and then suddenly much harder
When I have a data generation project, I rapidly try and work out the complexity requirements. I informally separate projects out into 2 categories.
Category 1 : Test Data
- These projects can be satisfied with really basic "junk" data.
- Simple data structures
- No need for business rules in the data for it to be useful
Example
If I needed to test a simple "Create Customer" API, then a simple script to generate fake names etc will probably suffice. It doesn't matter if I use names like "Donald Duck" or "Mickey Mouse" - it may not even matter if I reuse the same data repeatedly. In this case I'm simply trying to ensure that the API doesn't break when I send it a well formatted message, so the quality of the data is largely irrelevant.
The Faker python library is a really handy tool for this kind of work. Lots of more advanced synthetic data tools use this library. It's limited, but really useful for generating fake, but realistic names.
from faker import Faker
import json
fake = Faker('en_GB') # Using UK locale for British names
#Generate 10 test customer records
customers = []
for i in range(1, 11):
customer = {
'ID': f'P{i:04d}',
'Surname': fake.last_name(),
'Forename': fake.first_name(),
'Date_of_Birth': fake.date_of_birth(minimum_age=18, maximum_age=90).strftime('%Y-%m-%d')
}
customers.append(customer)
#Serialize to JSON
json_output = json.dumps(customer, indent=2)
print(json_output)
I also like to use Mockaroo for simple test generation cases. The free version allows you to generate 1000 records, and there are lots of options available for generating fake data for categories like cars, company names, or even animal names. It's affiliated with Tonic, which is a bigger, commercial offering, but I haven't had the opportunity to try that one.
Is AI any good?
It's definitely worth a try, if you genuinely want some really basic test data. You could write a simple prompt to create a few dummy records, and get a successful result.
The fact that LLM's are so good at writing code these days, makes me more inclined to use LLMs to generate code like the example above. That way, you can more easily re-run, tweak and experiment, which is especially useful on longer running projects.
Category 2 : Synthetic Data
- Realistic, High-quality data is required for the project
- The data is relational, introducing extra complexity to data generation
- The generated data has needs to implement business rules in order for it to be useful
I'm separating out these 2 categories, because it helps set expectations for a project. We don't want to over-complicate things if we have a straightforward problem to solve, but we also don't want to under estimate the effort involved in creating a robust, re-usable process for more complex scenarios.
Example
I am testing a Credit Card Application API, and want to generate some data to train end users. In this case sending "junk" data is not helpful. If the credit card application form requires a field called "employment type", the the data I provide must match a valid employment type, or the web application will error. If I provide a "date of birth" field, it must be valid for someone over 18 year of age or the web application will complain.
Does this distinction matter? Not a gigantic amount, but hopefully it helps emphasize the point about how quickly data generation can gain complexity due to even simple business rules.
Generating data from an existing source
For projects requiring a synthetic copy of a database (usually to provide data for a dev or test environment), then pseudonymising or anonymising a copy or subset of production data can be a relatively simple way.
The distinction between pseudonymisation and anonymisation can be really important here. Some highly regulated industries can be very particular about what constitutes truly anonymised data, so it's worth checking before adopting this as a general approach.
If you do decide to adopt this tactic, then you will likely end up with some custom code for each database. It's difficult to generalise this kind of work, so a dedicated commercial product like the aforementioned Tonic or Red Gate Test Data Manager might be the correct solution here.
I suspect that a lot of organisations end up with a mass of hand-rolled scripts!
Databricks DBL Datagen
For more advanced synthetic data generation, I've had a decent amount of success with DataBricks DBL DataGen.
This is definitely more powerful and flexible than Faker. You can specify ranges, values from a list, and it supports multi tables as well.
This code snippet (taken from the documentation) shows how to create a random purchase date within a range, and specify a return date that is large than the purchase date. It's a good example of how to handle some of the business rules we commonly find is actual datasets.
import dbldatagen as dg
from pyspark.sql.types import IntegerType
row_count = 1000 * 100
testDataSpec = (
dg.DataGenerator( spark, name="test_data_set1", rows=row_count, partitions=4,
randomSeedMethod="hash_fieldname", verbose=True, )
.withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
.withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
.withColumn(
"purchase_date",
"date",
data_range=dg.DateRange("2017-10-01 00:00:00", "2018-10-06 11:55:00", "days=3"),
random=True,
)
.withColumn(
"return_date",
"date",
expr="date_add(purchase_date, cast(floor(rand() * 100 + 1) as int))",
baseColumn="purchase_date",
)
)
dfTestData = testDataSpec.build()
At time of writing, DblDatagen is still actively supported. I particularly like the ability to specify distributions for datasets. This is really helpful when modelling a national health population for example. In fact and scenario where we need to provide synthetic data for analysts of data scientists is likely to be improved by accurately modelling distributions.
Statistical Methods and Deep Learning
To recap, I have covered the creation of basic test data for load testing or straightforward testing scenarious. We have also had a look at more sophisticated scenarios that require data with realistic values, data distributions, and support complex business rules.
I briefly discussed the tactic of generating synthetic data from existing data using the tactic of:
- Restore copy of production data
- Run script to pseudonymise or anonymise the data
Another method for generating high-quality data from an existing source is to use a tool like Synthetic Data Vault (SDV)
SDV is an open source project maintained by datacebo. Data cebo also have a commercial offering, but I haven't had the opportunity to use it.
SDV is actually remarkably easy to use. Below is an exert from the SDV Documentation. The link is to a Collab Notebook, so you can run the code and get a feel for how it hangs together.
#This except assumes we have already read in a padas dataframe called real_data
#It also assumes that we have run the command metadata = Metadata.detect_from_dataframe(real_data)
#the metatdata dataframe cointains information about the data types in the real_data dataframe, and is used by the synthesizer
from sdv.single_table import CTGANSynthesizer
synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(real_data)
#Generate the new, synthetic dataset
synthetic_data = synthesizer.sample(num_rows=500)
synthetic_data.head()
That's it!
Although the basic usage is quite straightforward, there's plenty more features required to complete a real-life project.
SDV has the ability to report on the similarities between the synthetic data and the original dataset - this is really useful for analysing the quality of the newly generated data.
It also has the ability to manage business rules via it's constraints feature.
Finally, you can experiment with a range of different Synthesizer options - from the more basic GaussianCopola synthesizer to more advanced options like CTGAN ans TVAE. These have a significant impact on the amount of compute required, so it's definitely worth investigating.
In fact the main drawback I've found with this approach to generating synthetic data is the amount of compute and the amount of time it can take to process. This will be a significant factor if you need to work with very large datasets.
Other Options
Nemo Data Designer from NVidea is another option I've looked at. Nvidea aquuired Gretal.AI in April 2025, and it looks as if they've quickly incorporated parts of that product into the Nvidea Nemo AI Agent Software Suite.
I went through the basic tutorials quite easily, and it has a similar set of features, but it's clearly optimised for use with the Nvidea Agent hosting services, so might be difficult to scale out if you aren't already using that service.
Summary
The ability to create high-quality synthetic data is crucial for high performing Engineering teams across a wide range of industries and domains.
Concerns over customer data security can create real challenges for software development lifecycles, but testing with unrealistic data can also cause real headaches and introduce unnecessary bugs and issues.
Teams with access to great synthetic data can have a real advantage when it comes to fast release cycles, and this in turn encourages experimentation and creativity.
In the past an organisation I worked for was unable to utilise off-shore engineering resource because of legal isues around data residency and access - reliable synthetic data would have mitigated this risk.
Finally, high quality synthetic datasets can enable research scenarios - especially in Healthcare, where the use of genuine data can be prohibitively complex or expensive.
I hope this helps provide a short introduction to the topic. Good luck with all of your future projects.